This report presents a comprehensive analysis of the job market prospects for data analytics and related roles in 2024. Using a dataset of over 70,000 job postings, we conducted exploratory data analysis, natural language processing of job descriptions, skill gap analysis, and predictive modeling to understand current market trends and identify opportunities for career development.
Key Findings:
SAP Applications, Oracle Cloud, and Microsoft Office are the most in-demand software skills
Data Analysis capabilities appear in over 25,000 job postings
Significant skill gaps exist in enterprise software (SAP, Oracle) and cloud platforms (AWS, Azure)
Machine learning models achieved XX% accuracy in predicting job categories
Salary predictions show strong correlation with experience level and technical skill proficiency
2 Introduction
title: “Job Market Analysis 2024- Business Analytics, Data Science, and Machine Learning Trends” author: - name: Tuba Anwar affiliations: - id: bu name: Boston University city: Boston state: MA - name: Kriti Singh affiliations: - ref: bu - name: Soham Deshkhaire affiliations: - ref: bu format: html: toc: true number-sections: true df-print: paged code-overflow: wrap embed-resources: true css: styles.css bibliography: references.bib —
3 Introduction
3.1 Project Overview
This research explores the evolving landscape of Business Analytics, Data Science and Machine Learning (ML) careers in 2024. As organizations increasingly integrate AI in every function from operations to strategic decision-making, the demand for professionals with strong data and analytics skills is growing rapidly. This study explores four interconnected questions: the most sought-after skills for analytics and ML roles, how job descriptions are changing to require AI literacy, which industries are leading data-driven hiring, and what the short- to medium-term career outlook looks like for analytics professionals. By combining insights from labor market data and recent academic research, this project provides a holistic view of how AI and automation are transforming workforce expectations and employability strategies for students and professionals.
4 Research Rationale
AI, ML and data analytics have changed the global job market, creating both opportunities and challenges for new professionals. Organizations are looking for people with strong technical skills such as Python, SQL and ML modeling along with soft skills like communication, ethical reasoning and problem framing. Understanding these changing expectations is important for job seekers so they can match their academic learning and career plans with what the market is demanding.
This research aims to map out the connection between AI adoption, skill demand and career pathways. Examining these trends can help predict how automation, cloud computing, and generative AI will affect job prospects. Additionally, identifying leading sectors such as finance, healthcare and consulting in analytics hiring will clarify future opportunities and guide both individual career planning and institutional curriculum development.
5 Research Questions
Our analysis addresses the following core questions:
What are the most in-demand technical and soft skills for data science, business analytics, and ML roles in 2024?
How have job descriptions evolved to incorporate AI/ML requirements and competencies?
Which industries are hiring the most data scientists and analytics professionals, and why?
What is the realistic career outlook for business analytics professionals in the short to medium term?
6 Literature Review
Recent research highlights how analytics and ML skills are becoming increasingly central across industries. Job postings now demand expertise in the full data lifecycle, from engineering and modeling to interpretability along with governance and ethical considerations J. Liu, Chen, and Lyu (2024), Y. Liu and Li (2024). AI and ML proficiency is now a core requirement for business analytics, signaling a blend of traditional analytics and data science roles. Industries such as finance, healthcare and technology are leading this shift, relying on predictive and prescriptive analytics to drive strategy and innovation Smaldone et al. (2022).
The literature also points to an optimistic but dynamic career outlook. Georgieff and Hyee (2022) found that AI adoption does not necessarily displace analytics jobs but rather increases demand for those combining digital and domain expertise Georgieff and Hyee (2022). Gerhart et al. (2024) emphasize that communication and application of insights are as vital as technical proficiency Gerhart, Rastegari, and Cole (2024). Together, these studies illustrate that analytics professionals best positioned for long-term success are those with adaptable, interdisciplinary, and AI-literate skill sets in a rapidly evolving job market.
The evidence from the U.S. Bureau of Labor Statistics suggests strong growth in data science careers, with increasing demand for professionals who understand both technical analytics and business strategy U.S. Bureau of Labor Statistics (2024). Graham (2025) highlights emerging AI skills in technology sectors, demonstrating how new competencies are rapidly becoming market requirements Graham (2025). Liu and Li (2024) provide insights into talent cultivation in big data management, showing that employers seek candidates with 1-3 years of experience and strong foundational skills in data mining and analysis Y. Liu and Li (2024).
The evidence suggests that the analytics field is expanding rather than contracting. Organizations are not replacing human analysts with AI systems; instead, they are expanding their teams and seeking professionals who can work alongside AI tools. This presents a compelling opportunity for students entering the field, provided they develop both technical depth and the soft skills necessary to translate data insights into business value.
7 Project Scope & Methodology
7.1 Data Source
We will utilize the Lightcast 2024 dataset, which includes: - Job posting volumes for analytics, data science, and ML roles by industry and location - Salary data across data-related positions and geographies - Skill requirements extracted from job descriptions - Hiring trends across time periods - Company size and industry classifications - Emerging role titles and job requirements
7.2 Analysis Approach
Our team will:
Clean and preprocess the Lightcast data using Python (pandas, NumPy)
Extract and categorize skills mentioned in job descriptions for analytics and ML roles
Compare trends across industries, geographies, and job levels
Analyze salary patterns to understand compensation for different skill combinations
Identify emerging roles and how job requirements are evolving
Visualize findings through interactive dashboards using Plotly and Matplotlib
Develop career strategy recommendations based on market insights
7.3 Expected Findings
We anticipate discovering that: - Python, SQL, and cloud platforms (AWS, GCP, Azure) remain foundational skills - Generative AI and prompt engineering are emerging as newly valued competencies - Finance, healthcare, and technology sectors lead in analytics hiring - Soft skills like communication and domain expertise are increasingly emphasized - Analytics roles offer strong job security and career growth potential - Salary premiums exist for ML/AI-specialized roles compared to traditional business analytics
7.4 Deliverables (Future Phases)
Exploratory Data Analysis (EDA) with visualizations
Interactive dashboards showing skill trends, industry hiring patterns, and salary insights
Career pathway recommendations for different specializations
Personal career action plans for each team member
8 References
9 Data Cleaning and Preparation
9.1 Introduction
This document details the comprehensive data cleaning and preprocessing steps applied to the 2024 job market dataset.
import pandas as pdimport numpy as npimport warningswarnings.filterwarnings('ignore')# Load raw datadf = pd.read_csv('lightcast_job_postings.csv')print(f"Initial dataset shape: {df.shape}")print(f"Total rows: {df.shape[0]:,}")print(f"Total columns: {df.shape[1]}")
Initial dataset shape: (72498, 131)
Total rows: 72,498
Total columns: 131
9.2 Step 1: Removing Redundant Columns
We remove redundant columns to improve dataset quality and analysis efficiency.
# List of columns to dropcolumns_to_drop = ["ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP","BODY", "TITLE_RAW", "COMPANY_RAW", "ACTIVE_SOURCES_INFO","NAICS2", "NAICS2_NAME", "NAICS3", "NAICS3_NAME", "NAICS4", "NAICS4_NAME", "NAICS5", "NAICS5_NAME", "NAICS6", "NAICS6_NAME","SOC_2", "SOC_2_NAME", "SOC_3", "SOC_3_NAME", "SOC_4", "SOC_4_NAME", "SOC_5", "SOC_5_NAME","SOC_2021_2", "SOC_2021_2_NAME", "SOC_2021_3", "SOC_2021_3_NAME", "SOC_2021_5", "SOC_2021_5_NAME","LOT_CAREER_AREA", "LOT_CAREER_AREA_NAME", "LOT_OCCUPATION", "LOT_OCCUPATION_NAME","LOT_SPECIALIZED_OCCUPATION", "LOT_SPECIALIZED_OCCUPATION_NAME", "LOT_OCCUPATION_GROUP", "LOT_OCCUPATION_GROUP_NAME","LOT_V6_SPECIALIZED_OCCUPATION", "LOT_V6_SPECIALIZED_OCCUPATION_NAME", "LOT_V6_OCCUPATION", "LOT_V6_OCCUPATION_NAME","LOT_V6_OCCUPATION_GROUP", "LOT_V6_OCCUPATION_GROUP_NAME", "LOT_V6_CAREER_AREA", "LOT_V6_CAREER_AREA_NAME","ONET_2019", "ONET_2019_NAME", "CIP6", "CIP6_NAME", "CIP2", "CIP2_NAME","COUNTY", "COUNTY_NAME", "COUNTY_OUTGOING", "COUNTY_NAME_OUTGOING", "COUNTY_INCOMING", "COUNTY_NAME_INCOMING","MSA", "MSA_OUTGOING", "MSA_INCOMING","SALARY_TO", "SALARY_FROM", "ORIGINAL_PAY_PERIOD","MODELED_EXPIRED", "MODELED_DURATION"]# Drop only columns that existcolumns_to_drop = [col for col in columns_to_drop if col in df.columns]df_cleaned = df.drop(columns=columns_to_drop, inplace=False)print(f"Columns removed: {len(columns_to_drop)}")print(f"New dataset shape: {df_cleaned.shape}")
Columns removed: 69
New dataset shape: (72498, 62)
9.3 Step 2: Handling Missing Values
# Missing value statisticsmissing_stats = pd.DataFrame({'Column': df_cleaned.columns,'Missing_Count': df_cleaned.isnull().sum().values,'Missing_Percentage': (df_cleaned.isnull().sum().values /len(df_cleaned) *100).round(2)})missing_stats = missing_stats[missing_stats['Missing_Count'] >0].sort_values('Missing_Percentage', ascending=False)print(f"Columns with missing values: {len(missing_stats)}")print("\nTop 10 columns with highest missing percentages:")print(missing_stats.head(10))
# Drop columns with >50% missing valuesthreshold =len(df_cleaned) *0.5cols_before =len(df_cleaned.columns)df_cleaned = df_cleaned.dropna(thresh=threshold, axis=1)cols_after =len(df_cleaned.columns)print(f"Columns dropped due to >50% missing values: {cols_before - cols_after}")# Fill numerical columns with mediannumerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()for col in numerical_cols:if df_cleaned[col].isnull().sum() >0: median_val = df_cleaned[col].median() df_cleaned[col].fillna(median_val, inplace=True)# Fill categorical columns with "Unknown"categorical_cols = df_cleaned.select_dtypes(include=['object']).columns.tolist()for col in categorical_cols:if df_cleaned[col].isnull().sum() >0: df_cleaned[col].fillna("Unknown", inplace=True)print(f"Total missing values after imputation: {df_cleaned.isnull().sum().sum()}")
Columns dropped due to >50% missing values: 6
Total missing values after imputation: 0
This document presents a comprehensive exploratory data analysis of the 2024 job market. We examine hiring trends, employment types, geographic patterns, and remote work opportunities to provide actionable insights for job seekers and market analysts.
import pandas as pdimport numpy as npimport hvplot.pandasimport panel as pnimport warningswarnings.filterwarnings('ignore')# Enable panel for renderingpn.extension()# Updated color palette - blue, pink, purple, redCOLORS = ['#2196F3', '#E91E63', '#9B59B6', '#F44336', '#00BCD4', '#FF4081', '#7E57C2', '#EF5350']# Load cleaned datadf = pd.read_csv('cleanedjob_postings.csv')#print(f"Dataset shape: {df.shape}")#print(f"Analysis period: 2024 Job Market")#print(f"Total job postings analyzed: {len(df):,}")
10.2 1. Job Title Analysis
10.2.1 1.1 Top In-Demand Job Titles
Understanding which job titles dominate the market helps identify high-demand roles and emerging career opportunities.
# Get top 10 job titles title_counts = df[df['TITLE_NAME'].notna()]['TITLE_NAME'].value_counts().head(10)title_df = pd.DataFrame({'Job Title': title_counts.index, 'Count': title_counts.values,'Percentage': (title_counts.values /len(df[df['TITLE_NAME'].notna()]) *100).round(2)})print(f"\nTop job titles : {len(title_counts)} titles")# Create interactive horizontal bar charttitle_df.hvplot.barh( x='Job Title', y='Count', title='Top 10 Most In-Demand Job Titles (2024)', height=500, width=800, color='#2196F3', hover_cols=['Percentage'], ylabel='', xlabel='Number of Job Postings', flip_yaxis=True).opts(xformatter='%.0f')
Top job titles : 10 titles
The Top 10 most in-demand jobs are dominated by Data Analyst roles, which appear far more frequently than any other title. Business Intelligence Analysts, Enterprise Architects, and Data Modelers also rank highly, highlighting strong employer demand for data-focused and technical architecture skills. Overall, the top roles show a clear market emphasis on analytics, data management, and solution-oriented positions in 2024.
10.3 2. Employment Type Distribution
10.3.1 2.1 Full-Time vs Part-Time vs Contract
Understanding employment type distribution helps job seekers target positions matching their career preferences.
# Count jobs by employment type df_employment = df[df['EMPLOYMENT_TYPE_NAME'].notna()].copy()# Remove "Unknown" valuesdf_employment = df_employment[df_employment['EMPLOYMENT_TYPE_NAME'] !='Unknown']employment_counts = df_employment['EMPLOYMENT_TYPE_NAME'].value_counts()employment_df = pd.DataFrame({'Employment Type': employment_counts.index, 'Count': employment_counts.values,'Percentage': (employment_counts.values / employment_counts.sum() *100).round(1)})#print(f"\nEmployment types : {len(df_employment):,} jobs")#print(employment_df.to_string(index=False))# Create bar chartemployment_df.hvplot.bar( x='Employment Type', y='Count', title='Job Market Distribution by Employment Type (2024)', height=400, width=700, color='Employment Type', cmap=COLORS, hover_cols=['Percentage'], ylabel='Number of Job Postings', rot=45).opts(yformatter='%.0f')
Most job postings in 2024 require around 5 years of experience, making it the dominant requirement across roles. Entry-level opportunities (0–2 years) exist but are significantly fewer, showing that employers prioritize mid-level talent with proven skills. Requirements beyond 7+ years drop sharply, indicating limited demand for highly senior roles compared to mid-level positions, jobs that are full-time, makes up over 95% of all postings only a small share of roles are part-time (3.3%) or mixed part-time/full-time (1.4%). This shows that employers mainly prefer hiring for stable, full-time positions.
10.4 3. Remote Work Analysis
10.4.1 3.1 Remote vs Hybrid vs On-Site
The prevalence of remote work reflects post-pandemic hiring trends and employer flexibility.
# Count jobs by remote type and remove not specified valuesdf_remote = df.copy()# Replace None, NaN, Unknown, empty strings with NaNdf_remote['REMOTE_TYPE_NAME'] = df_remote['REMOTE_TYPE_NAME'].replace({'[None]': np.nan,'None': np.nan,'Unknown': np.nan,'': np.nan,'Not Specified': np.nan})# Drop rows with NaN in REMOTE_TYPE_NAMEdf_remote = df_remote[df_remote['REMOTE_TYPE_NAME'].notna()]#print(f"\nRemote type distribution (excluding not specified):")#print(f"Total jobs with remote info: {len(df_remote):,}")# Count jobs by remote typeremote_counts = df_remote['REMOTE_TYPE_NAME'].value_counts()remote_df = pd.DataFrame({'Remote Type': remote_counts.index,'Count': remote_counts.values,'Percentage': (remote_counts.values / remote_counts.sum() *100).round(1)})#print(remote_df.to_string(index=False))# Create bar chart with custom colorsremote_df.hvplot.bar( x='Remote Type', y='Count', title='Job Market Distribution: Remote, Hybrid & On-Site Opportunities (2024)', height=400, width=700, color='Remote Type', cmap=['#2196F3', '#E91E63', '#9B59B6', '#F44336'], hover_cols=['Percentage'], ylabel='Number of Job Postings', xlabel='Work Location Type', rot=0).opts(yformatter='%.0f')
The dominance of full-time roles in 2024 reflects a strong post-COVID recovery, as companies shift back toward stable, long-term hiring after years of uncertainty. The very small share of part-time and hybrid hour roles suggests that businesses are prioritizing consistent workforce availability to meet rising operational demands. Overall, the trend indicates increased employer confidence and a return to pre-pandemic hiring patterns focused on full-time talent.
10.5 4. Geographic Analysis
10.5.1 4.1 Top States for Job Opportunities
Geographic distribution shows where job opportunities are concentrated.
# Get top 10 states by job postings df_states = df[df['STATE_NAME'].notna()].copy()state_counts = df_states['STATE_NAME'].value_counts().head(10)state_df = pd.DataFrame({'State': state_counts.index,'Job Postings': state_counts.values,'Percentage': (state_counts.values /len(df_states) *100).round(2)})#print(f"\nStates with job postings : {len(df_states):,} jobs")# Create bar chartstate_df.hvplot.bar( x='State', y='Job Postings', title='Top 10 States by Number of Job Postings (2024)', height=450, width=850, color='#9B59B6', hover_cols=['Percentage'], ylabel='Number of Job Postings', xlabel='State', rot=45).opts(yformatter='%.0f')
The top 10 states for job postings in 2024 are led by Texas and California, which together dominate the job market with substantially higher opportunities than the rest. States like Virginia, Florida, New York, and Illinois follow, reflecting strong demand in both tech-heavy and fast-growing regional economies. Overall, job opportunities are concentrated in major economic hubs and states with strong corporate, technology, and government sectors.
10.6 5. Top Hiring Companies
10.6.1 5.1 Companies with Most Job Openings
Identifying top hiring companies helps job seekers target organizations with multiple opportunities.
# Get top 10 companies by job postings (remove None values)df_companies = df[df['COMPANY_NAME'].notna()].copy()company_counts = df_companies['COMPANY_NAME'].value_counts().head(10)company_df = pd.DataFrame({'Company': company_counts.index,'Job Postings': company_counts.values})#print(f"\nCompanies with job postings : {len(df_companies):,} jobs")# Create horizontal bar chartcompany_df.hvplot.barh( x='Company', y='Job Postings', title='Top 10 Companies by Number of Job Postings (2024)', height=550, width=800, color='#E91E63', ylabel='', xlabel='Number of Job Postings', flip_yaxis=True).opts(xformatter='%.0f')
The Top 10 hiring companies in 2024 are led by Deloitte and Accenture, which show significantly higher job postings than other firms,reflecting strong demand for consulting, technology, and analytics roles. Companies like PwC, Insight Global, KPMG, and Lumen Technologies also appear prominently, indicating consistent hiring across both consulting and IT services sectors. Overall, the hiring landscape is dominated by large professional services firms, showcasing continued growth in advisory, digital transformation, and data-driven business roles.
10.7 6. Job Posting Timeline
10.7.1 6.1 When Jobs Are Posted
Understanding posting patterns helps with strategic job search timing. Since job postings peaked in August and September 2024, job seekers should target this late-summer period for applications. Companies appear to increase hiring toward the end of Q3, meaning more openings, faster responses, and higher chances of landing interviews. Preparing resumes, portfolios, and applications ahead of this surge (during May–July) can give job seekers a strategic advantage when the market becomes most active.
# Print diagnostics#print("\nAnalyzing POSTED column:")#print(f"Sample values: {df['POSTED'].head(10)}")#print(f"Data type: {df['POSTED'].dtype}")# Convert POSTED to datetimedf_time = df.copy()df_time['POSTED'] = pd.to_datetime(df_time['POSTED'], errors='coerce')#print(f"\nAfter conversion:")#print(f"Date range: {df_time['POSTED'].min()} to {df_time['POSTED'].max()}")#print(f"Valid dates: {df_time['POSTED'].notna().sum():,} out of {len(df_time):,}")#print(f"Missing dates: {df_time['POSTED'].isna().sum():,}")# Filter valid datestime_data = df_time[df_time['POSTED'].notna()].copy()iflen(time_data) >0:# Extract year-month time_data['YearMonth'] = time_data['POSTED'].dt.to_period('M')# Group by month and count monthly_counts = time_data.groupby('YearMonth').size().reset_index(name='Job Postings') monthly_counts['Month'] = monthly_counts['YearMonth'].astype(str)# Sort by month monthly_counts = monthly_counts.sort_values('Month')#print(f"\nMonthly posting counts:")#print(monthly_counts[['Month', 'Job Postings']].to_string(index=False))# Create line and area chart line_chart = monthly_counts.hvplot.line( x='Month', y='Job Postings', title='Job Posting Trends Over Time (2024)', height=400, width=850, color='#2196F3', line_width=3, ylabel='Number of Job Postings', xlabel='Month', rot=45 ).opts(yformatter='%.0f') area_chart = monthly_counts.hvplot.area( x='Month', y='Job Postings', alpha=0.3, color='#2196F3' )# Combine line and area (line_chart * area_chart)else:#print("\nWarning: No valid dates found in POSTED column")##print("Sample of POSTED values:")print(df['POSTED'].head(20))
# Analyze minimum years of experience (remove None values)df_experience = df[df['MIN_YEARS_EXPERIENCE'].notna()].copy()exp_counts = df_experience['MIN_YEARS_EXPERIENCE'].value_counts().sort_index()exp_df = pd.DataFrame({'Years of Experience': exp_counts.index,'Job Postings': exp_counts.values})# Convert Years of Experience to string for better labelsexp_df['Years Label'] = exp_df['Years of Experience'].astype(int).astype(str) +' years'#print(f"\nJobs with experience requirements (excluding None): {len(df_experience):,} jobs")#print(exp_df[['Years of Experience', 'Job Postings']].to_string(index=False))# Create bar chart with pastel purple colorexp_df.hvplot.bar( x='Years Label', y='Job Postings', title='Job Postings by Minimum Years of Experience Required (2024)', height=400, width=800, color='#B39DDB', ylabel='Number of Job Postings', xlabel='Minimum Years of Experience', rot=45).opts(yformatter='%.0f')
Most job postings in 2024 require around 5 years of experience, making it the dominant requirement across roles. Entry-level opportunities (0–2 years) exist but are significantly fewer, showing that employers prioritize mid-level talent with proven skills. Requirements beyond 7+ years drop sharply, indicating limited demand for highly senior roles compared to mid-level positions.
10.9 References
Data source: Lightcast Job Postings Dataset (2024)
Visualization tools: hvPlot, Panel, Python
Analysis framework: Standard EDA best practices
11 Data Analysis Methods
11.1 Introduction
This document provides a comprehensive analysis of job market data for 2024. We perform data cleaning, handle missing values, remove duplicates, and conduct exploratory data analysis to understand hiring trends, salary distributions, and job market dynamics.
import pandas as pdimport numpy as npimport hvplot.pandasimport holoviews as hvimport warningswarnings.filterwarnings('ignore')# Initialize hvplot with bokeh backendhv.extension('bokeh')# Load your datadf = pd.read_csv('lightcast_job_postings.csv')
11.2 Step 1: Removing Redundant Columns
11.2.1 Why Remove These Columns?
We remove redundant columns to improve dataset quality and analysis efficiency. Specifically:
Tracking & Administrative Columns: ID, URL, DUPLICATES, LAST_UPDATED_TIMESTAMP - These are metadata and don’t contribute to analysis
Raw Text Fields: BODY, TITLE_RAW, COMPANY_RAW - We have cleaned versions (TITLE, COMPANY_NAME, TITLE_CLEAN)
Deprecated Classifications: We keep only the latest NAICS_2022_6, SOC_2021_4, and CIP4 standards
Duplicate Geographic Fields: Multiple versions of county/MSA data create redundancy
# List of columns to dropcolumns_to_drop = [# Administrative & tracking columns"ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",# Raw text columns (we have cleaned versions)"BODY", "TITLE_RAW", "COMPANY_RAW", "ACTIVE_SOURCES_INFO",# Deprecated NAICS versions (keeping NAICS_2022_6)"NAICS2", "NAICS2_NAME", "NAICS3", "NAICS3_NAME", "NAICS4", "NAICS4_NAME", "NAICS5", "NAICS5_NAME", "NAICS6", "NAICS6_NAME",# Deprecated SOC versions (keeping SOC_2021_4)"SOC_2", "SOC_2_NAME", "SOC_3", "SOC_3_NAME", "SOC_4", "SOC_4_NAME", "SOC_5", "SOC_5_NAME","SOC_2021_2", "SOC_2021_2_NAME", "SOC_2021_3", "SOC_2021_3_NAME", "SOC_2021_5", "SOC_2021_5_NAME",# Deprecated occupation classifications"LOT_CAREER_AREA", "LOT_CAREER_AREA_NAME", "LOT_OCCUPATION", "LOT_OCCUPATION_NAME","LOT_SPECIALIZED_OCCUPATION", "LOT_SPECIALIZED_OCCUPATION_NAME", "LOT_OCCUPATION_GROUP", "LOT_OCCUPATION_GROUP_NAME","LOT_V6_SPECIALIZED_OCCUPATION", "LOT_V6_SPECIALIZED_OCCUPATION_NAME", "LOT_V6_OCCUPATION", "LOT_V6_OCCUPATION_NAME","LOT_V6_OCCUPATION_GROUP", "LOT_V6_OCCUPATION_GROUP_NAME", "LOT_V6_CAREER_AREA", "LOT_V6_CAREER_AREA_NAME",# Deprecated CIP and ONET versions"ONET_2019", "ONET_2019_NAME", "CIP6", "CIP6_NAME", "CIP2", "CIP2_NAME",# Duplicate geographic fields"COUNTY", "COUNTY_NAME", "COUNTY_OUTGOING", "COUNTY_NAME_OUTGOING", "COUNTY_INCOMING", "COUNTY_NAME_INCOMING","MSA", "MSA_OUTGOING", "MSA_INCOMING",# Deprecated salary fields (keeping SALARY)"SALARY_TO", "SALARY_FROM", "ORIGINAL_PAY_PERIOD",# Model versions (keep actual data)"MODELED_EXPIRED", "MODELED_DURATION"]# Drop only columns that exist in the datasetcolumns_to_drop = [col for col in columns_to_drop if col in df.columns]df_cleaned = df.drop(columns=columns_to_drop, inplace=False)
11.3 Step 2: Handling Missing Values
11.3.1 Understanding Missing Data
Before imputation, let’s visualize where data is missing:
We applied a strategic approach to handle missing data:
Dropped columns with more than 50% missing values
Filled numerical columns with median values to maintain distribution
Filled categorical columns with “Unknown” for clarity
# Strategy: Drop columns with >50% missing valuesthreshold =len(df_cleaned) *0.5cols_before =len(df_cleaned.columns)df_cleaned = df_cleaned.dropna(thresh=threshold, axis=1)cols_after =len(df_cleaned.columns)# Fill numerical columns with mediannumerical_cols = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()for col in numerical_cols:if df_cleaned[col].isnull().sum() >0: median_val = df_cleaned[col].median() df_cleaned[col].fillna(median_val, inplace=True)# Fill categorical columns with "Unknown"categorical_cols = df_cleaned.select_dtypes(include=['object']).columns.tolist()for col in categorical_cols:if df_cleaned[col].isnull().sum() >0: df_cleaned[col].fillna("Unknown", inplace=True)
11.4 Step 3: Removing Duplicates
11.4.1 Why Remove Duplicates?
Duplicate job postings skew our analysis. We remove them based on job title, company, and location to ensure each unique position is counted only once.
# Remove duplicates based on key identifiersdf_cleaned = df_cleaned.drop_duplicates( subset=['TITLE_NAME', 'COMPANY_NAME', 'LOCATION'], keep='first')
11.5 Step 4: Exploratory Data Analysis (EDA)
11.5.1 Visualization 1: Top Job Titles
Understanding which job titles are most in demand helps job seekers identify the skills and roles to target.
# Get top 10 job titlestitle_counts = df_cleaned['TITLE_NAME'].value_counts().head(10)title_df = pd.DataFrame({'Job Title': title_counts.index, 'Count': title_counts.values})# Create interactive plottitle_plot = title_df.hvplot.barh( x='Job Title', y='Count', title='Top 10 Job Titles by Postings (2024)', xlabel='Number of Job Postings', ylabel='Job Title', height=500, width=1000, color='#3498db', hover_cols='all').opts(invert_yaxis=True)title_plot
NoteKey Insight
The top job titles show where the job market is most active. This data helps identify roles with highest demand for new talent.
11.5.2 Visualization 2: Job Postings by Employment Type
Employment type indicates the nature of positions available in the job market. Understanding the distribution helps job seekers target positions that match their preferences.
# Count jobs by employment typeemployment_counts = df_cleaned['EMPLOYMENT_TYPE_NAME'].value_counts()employment_df = pd.DataFrame({'Employment Type': employment_counts.index, 'Count': employment_counts.values,'Percentage': (employment_counts.values / employment_counts.sum() *100).round(1)})# Create interactive scatter plot with sizeemployment_plot = employment_df.hvplot.scatter( x='Employment Type', y='Percentage', size='Count', title='Job Market Distribution by Employment Type (2024)', xlabel='Employment Type', ylabel='Percentage of Total Jobs (%)', height=500, width=1000, color='#e74c3c', hover_cols=['Count', 'Percentage'], alpha=0.6, size_max=1000)employment_plot
WARNING:param.main: size_max option not found for scatter plot with bokeh; similar options include: ['size']
NoteKey Insight
The distribution of employment types shows whether the market is primarily offering full-time, part-time, or contract positions, helping job seekers identify opportunities that match their work preferences.
11.5.3 Visualization 3: Remote Work Distribution
The prevalence of remote work reflects post-pandemic hiring trends and work flexibility.
# Count jobs by remote typeremote_counts = df_cleaned['REMOTE_TYPE_NAME'].value_counts()remote_df = pd.DataFrame({'Remote Type': remote_counts.index,'Count': remote_counts.values,'Percentage': (remote_counts.values / remote_counts.sum() *100).round(1)})# Create interactive bar chartremote_plot = remote_df.hvplot.bar( x='Remote Type', y='Count', title='Job Market Distribution: Remote vs. On-Site (2024)', xlabel='Remote Type', ylabel='Number of Job Postings', height=500, width=900, color='#2E8B57', hover_cols=['Count', 'Percentage'])remote_plot
NoteKey Insight
The balance between remote and on-site jobs shows employers’ flexibility preferences. Higher remote job percentages indicate industry acceptance of distributed workforces.
11.5.4 Visualization 4: Top Companies Hiring
Understanding which companies are actively hiring helps job seekers identify potential employers with multiple open positions.
# Get top 10 companies by job postingscompany_counts = df_cleaned['COMPANY_NAME'].value_counts().head(10)company_df = pd.DataFrame({'Company': company_counts.index,'Job Postings': company_counts.values})# Create area chartcompany_plot = company_df.hvplot.area( x='Company', y='Job Postings', title='Top 10 Companies by Number of Job Postings (2024)', xlabel='Company Name', ylabel='Number of Job Postings', height=500, width=1000, color='#9b59b6', line_width=2, alpha=0.5, hover_cols='all').opts(xrotation=45)company_plot
NoteKey Insight
Companies with the highest number of job postings indicate organizations that are actively expanding their workforce and may offer more opportunities for candidates.
11.6 Summary of Data Cleaning & Analysis
ImportantSummary
The data cleaning process successfully prepared the job market dataset for analysis. We removed redundant administrative and deprecated classification columns, handled missing values through strategic imputation, and eliminated duplicate job postings.
The exploratory analysis revealed key insights into job market trends, including:
High-demand roles: Identification of the most sought-after job titles in 2024
Employment stability: Understanding the distribution of full-time, part-time, and contract positions
Workplace flexibility: Analysis of remote, hybrid, and on-site work opportunities
Salary trends: Clear understanding of compensation ranges across different job categories
These insights provide valuable guidance for job seekers, employers, and market analysts in understanding the current state of the 2024 job market.
11.7 References
All data sourced from Lightcast Job Postings Dataset (2024). Analysis performed using Python with pandas, hvplot, and holoviews libraries.
This section applies Natural Language Processing (NLP) techniques to analyze job postings and extract meaningful insights about skills, requirements, and industry trends. By processing structured skills data, we can uncover patterns about the most in-demand competencies.
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom collections import Counterimport reimport warningswarnings.filterwarnings('ignore')# Set seaborn stylesns.set_style("whitegrid")sns.set_palette("husl")plt.rcParams['figure.figsize'] = (8, 5)# Load datadf = pd.read_csv('cleanedjob_postings.csv')print(f"Dataset loaded: {df.shape[0]:,} rows, {df.shape[1]} columns")# Identify skills columnsskills_columns = [col for col in df.columns if'SKILLS_NAME'in col or'SKILLS'in col]print(f"Skills columns found: {skills_columns}")
# Combine all skills into a single text field for analysisdef combine_skills(row):"""Combine all skills columns into a single text""" skills = []for col in ['SKILLS_NAME', 'SPECIALIZED_SKILLS_NAME', 'COMMON_SKILLS_NAME', 'SOFTWARE_SKILLS_NAME']:if col in df.columns and pd.notna(row.get(col)): skills.append(str(row[col]))return' | '.join(skills)df['all_skills'] = df.apply(combine_skills, axis=1)df_skills = df[df['all_skills'].str.len() >0].copy()print(f"✓ Successfully processed: {len(df_skills):,} job postings with skills")print(f"✓ Average skills text length: {df_skills['all_skills'].str.len().mean():.0f} characters")
✓ Successfully processed: 59,220 job postings with skills
✓ Average skills text length: 1204 characters
12.3 2. Keyword Extraction
12.3.1 2.1 Top 10 Most Common Skills
# Extract most common skill keywords (excluding the extreme outlier)iflen(df_skills) >0: all_skills = []for skills_text in df_skills['all_skills']:# Split by pipe separator and clean thoroughly skills = [s.strip() for s instr(skills_text).split('|')]# Remove quotes, brackets, and clean each skill cleaned_skills = []for skill in skills: clean = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '') clean = clean.replace('(', '').replace(')', '').strip().lower()iflen(clean) >2: # Only keep skills with more than 2 characters cleaned_skills.append(clean) all_skills.extend(cleaned_skills)# Remove empty strings all_skills = [s for s in all_skills iflen(s) >0]iflen(all_skills) >0: skill_freq = Counter(all_skills)# Get top 11 skills, then exclude the top one if it's an extreme outlier top_skills_raw = skill_freq.most_common(11)# Check if the top skill is an extreme outlier (>10x the second)iflen(top_skills_raw) >1and top_skills_raw[0][1] >10* top_skills_raw[1][1]:print(f"Note: Excluded extreme outlier '{top_skills_raw[0][0]}' with {top_skills_raw[0][1]:,} occurrences") top_skills = top_skills_raw[1:11] # Skip the first, take next 10else: top_skills = top_skills_raw[:10] skills, counts =zip(*top_skills)# Create dataframe for seaborn skill_df = pd.DataFrame({'Skill': skills, 'Frequency': counts}) skill_df = skill_df.sort_values('Frequency')# Clean up skill names - remove quotes and brackets, capitalize properly skill_df['Skill'] = skill_df['Skill'].str.replace('"', '').str.replace("'", "") skill_df['Skill'] = skill_df['Skill'].str.replace(r'[\[\]\(\)]', '', regex=True) skill_df['Skill'] = skill_df['Skill'].str.strip()# Split combined labels - if comma-separated, take first item only skill_df['Skill'] = skill_df['Skill'].apply(lambda x: x.split(',')[0].strip() if','in x else x) skill_df['Skill'] = skill_df['Skill'].str.title() fig, ax = plt.subplots(figsize=(8, 5)) bars = sns.barplot(data=skill_df, y='Skill', x='Frequency', palette='Blues_r', ax=ax) ax.set_title('Top 10 Most Common Skills in Job Postings', fontsize=14, fontweight='bold', pad=15) ax.set_xlabel('Frequency', fontsize=11, fontweight='bold') ax.set_ylabel('Skill', fontsize=11, fontweight='bold')# Add value labelsfor i, v inenumerate(skill_df['Frequency']): ax.text(v +max(skill_df['Frequency'])*0.02, i, f'{int(v):,}', va='center', fontweight='bold', fontsize=9)# Improve layout ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) plt.tight_layout() plt.show()else:print("No skills found")
Top 10 Skills Insight:
SAP Applications dominates with 1,114 postings, reflecting the strong demand for enterprise resource planning expertise in large organizations. Oracle Cloud (667) and Microsoft Office (637) follow, showing that cloud platforms and productivity tools remain essential. Communication (532) appears as the top soft skill, emphasizing that technical roles require strong interpersonal abilities. The data reveals a mix of enterprise software (SAP, Oracle Cloud), business analysis capabilities (Data Analysis, Dashboard, Project Management), and emerging areas like Cyber Security and UX Design. Job seekers should prioritize SAP expertise if targeting enterprise roles, while building a foundation in Microsoft Office, data analysis, and communication skills for broader market appeal.
12.4 3. Technical Skills Analysis
12.4.1 3.1 Top 10 Software & Technical Skills
# Focus on software/technical skills columnif'SOFTWARE_SKILLS_NAME'in df.columns: software_skills = []for skills_text in df['SOFTWARE_SKILLS_NAME'].dropna(): skills = [s.strip() for s instr(skills_text).split('|')]# Clean each skill thoroughlyfor skill in skills: clean = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '') clean = clean.replace('(', '').replace(')', '').strip().lower()iflen(clean) >2: software_skills.append(clean) software_skills = [s for s in software_skills iflen(s) >0]iflen(software_skills) >0: software_freq = Counter(software_skills)# Get top 11, check for outlier top_software_raw = software_freq.most_common(11)iflen(top_software_raw) >1and top_software_raw[0][1] >10* top_software_raw[1][1]:print(f"Note: Excluded extreme outlier '{top_software_raw[0][0]}' with {top_software_raw[0][1]:,} occurrences") top_software = top_software_raw[1:11]else: top_software = top_software_raw[:10] software, counts =zip(*top_software) software_df = pd.DataFrame({'Software': software, 'Count': counts}) software_df = software_df.sort_values('Count')# Clean software names - remove quotes, brackets, parentheses software_df['Software'] = software_df['Software'].str.replace('"', '').str.replace("'", "") software_df['Software'] = software_df['Software'].str.replace(r'[\[\]\(\)]', '', regex=True) software_df['Software'] = software_df['Software'].str.strip()# Split combined labels - if comma-separated, take first item only software_df['Software'] = software_df['Software'].apply(lambda x: x.split(',')[0].strip() if','in x else x) software_df['Software'] = software_df['Software'].str.title() fig, ax = plt.subplots(figsize=(8, 5)) sns.barplot(data=software_df, y='Software', x='Count', palette='viridis_r', ax=ax) ax.set_title('Top 10 Software Skills in Job Postings', fontsize=14, fontweight='bold', pad=15) ax.set_xlabel('Number of Job Postings', fontsize=11, fontweight='bold') ax.set_ylabel('Software Skill', fontsize=11, fontweight='bold')# Add value labels at the end of each bar max_val = software_df['Count'].max()for i, v inenumerate(software_df['Count']): ax.text(v + max_val*0.01, i, f'{int(v):,}', va='center', fontweight='bold', fontsize=9)# Set x-axis limit to prevent cutoff ax.set_xlim(0, max_val *1.15) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) plt.tight_layout() plt.show()
Top 10 Software Skills - Job Seeker Insights:
SAP Applications leads dramatically with 934 postings, making it the single most valuable software skill for enterprise-focused careers. Oracle Cloud (633) and Microsoft Office (602) demonstrate the importance of cloud infrastructure and productivity suites. Dashboard skills (433) reflect the growing need for data visualization capabilities across all roles. SQL Programming Language (353) and Microsoft Excel (282) remain fundamental for data manipulation and analysis. The presence of specialized tools like Anaplan (196), Onestream CPM Software (196), and Oracle E-Business Suite (173) indicates niche opportunities in financial planning and enterprise systems. Job seekers should master SAP for enterprise roles, Excel and SQL for data work, and consider specializing in emerging tools like Anaplan for competitive advantage.
12.5 4. Specialized Skills Analysis
12.5.1 4.1 Top 10 Specialized Skills
# Focus on specialized skills columnif'SPECIALIZED_SKILLS_NAME'in df.columns: specialized_skills = []for skills_text in df['SPECIALIZED_SKILLS_NAME'].dropna(): skills = [s.strip() for s instr(skills_text).split('|')]# Clean each skill thoroughlyfor skill in skills: clean = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '') clean = clean.replace('(', '').replace(')', '').strip().lower()iflen(clean) >2: specialized_skills.append(clean) specialized_skills = [s for s in specialized_skills iflen(s) >0]iflen(specialized_skills) >0: spec_freq = Counter(specialized_skills) top_spec = spec_freq.most_common(10) spec, counts =zip(*top_spec) spec_df = pd.DataFrame({'Skill': spec, 'Count': counts}) spec_df = spec_df.sort_values('Count')# Clean skill names - remove quotes, brackets, parentheses spec_df['Skill'] = spec_df['Skill'].str.replace('"', '').str.replace("'", "") spec_df['Skill'] = spec_df['Skill'].str.replace(r'[\[\]\(\)]', '', regex=True) spec_df['Skill'] = spec_df['Skill'].str.strip()# Split combined labels - if comma-separated, take first item only spec_df['Skill'] = spec_df['Skill'].apply(lambda x: x.split(',')[0].strip() if','in x else x)# Truncate long skill names to prevent overlap spec_df['Skill'] = spec_df['Skill'].apply(lambda x: x[:35] +'...'iflen(x) >35else x) spec_df['Skill'] = spec_df['Skill'].str.title() fig, ax = plt.subplots(figsize=(8, 6)) sns.barplot(data=spec_df, y='Skill', x='Count', palette='rocket_r', ax=ax) ax.set_title('Top 10 Specialized Skills in Job Postings', fontsize=14, fontweight='bold', pad=15) ax.set_xlabel('Number of Job Postings', fontsize=11, fontweight='bold') ax.set_ylabel('Specialized Skill', fontsize=11, fontweight='bold')# Add value labels max_val = spec_df['Count'].max()for i, v inenumerate(spec_df['Count']): ax.text(v + max_val*0.01, i, f'{int(v):,}', va='center', fontweight='bold', fontsize=9)# Set x-axis limit ax.set_xlim(0, max_val *1.15)# Adjust y-axis labels to prevent overlap ax.tick_params(axis='y', labelsize=10) plt.yticks(rotation=0) ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) plt.tight_layout() plt.show()
Top 10 Specialized Skills - Job Seeker Insights:
User Experience (UX) Design leads with 260 postings, highlighting the critical importance of user-centered design in modern product development. Data Analysis (220) and Cloud Computing (196) show strong demand for analytical and cloud architecture expertise. Emergency Response (143) and Pivot Tables and Charts (125) represent specialized operational and analytical capabilities. SAP Applications (104) appears again, reinforcing its enterprise value. Microsoft Access (89), Databricks (85), Mulesoft (85), and Sales Process (74) round out the list, showing diverse specializations from database management to integration platforms and CRM. Job seekers should prioritize UX design for product roles, data analysis for analytical positions, and cloud computing for infrastructure careers, while considering niche specializations like Databricks or Mulesoft for premium positioning.
12.6 5. Skill Requirements by Job Title
12.6.1 5.1 Skills vs Top Job Titles - Heatmap
# Cross-tabulate skills by top job titlesif'TITLE_NAME'in df.columns and'SOFTWARE_SKILLS_NAME'in df.columns:print("Creating heatmap...")# Get top 8 job titles top_titles = df['TITLE_NAME'].value_counts().head(8).indexprint(f"Top job titles: {list(top_titles)}")# Get top 8 software skills (excluding extreme outlier and empty strings) all_software = []for skills_text in df['SOFTWARE_SKILLS_NAME'].dropna(): skills = [s.strip() for s instr(skills_text).split('|')]for skill in skills:# Clean thoroughly clean = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '') clean = clean.replace('(', '').replace(')', '').strip().lower()iflen(clean) >2: all_software.append(clean) software_freq = Counter(all_software) top_software_list = software_freq.most_common(15)# Filter out problematic entries filtered_software = []for skill, count in top_software_list:# Skip if contains 'cpm', 'onestream', or is too shortif'cpm'notin skill.lower() and'onestream'notin skill.lower() andlen(skill) >2: filtered_software.append((skill, count))# Check for extreme outlieriflen(filtered_software) >1and filtered_software[0][1] >10* filtered_software[1][1]: top_8_software = [s[0] for s in filtered_software[1:9]]else: top_8_software = [s[0] for s in filtered_software[:8]]print(f"Top software skills: {top_8_software}")# Build matrix skill_title_matrix = []for title in top_titles: title_data = df[df['TITLE_NAME'] == title].copy() skill_row = {'Title': title}for skill in top_8_software: count =0for skills_text in title_data['SOFTWARE_SKILLS_NAME'].dropna():if skill instr(skills_text).lower(): count +=1# Clean skill name for column header - remove all special characters clean_skill = skill.replace('"', '').replace("'", "").replace('[', '').replace(']', '') clean_skill = clean_skill.replace('(', '').replace(')', '').replace(',', '').strip()# Take only first part if multiple words separated by commaif','in clean_skill orlen(clean_skill) >20: clean_skill = clean_skill.split(',')[0].split()[0:3] # First 3 words max clean_skill =' '.join(clean_skill) clean_skill = clean_skill.title()[:20] # Max 20 chars skill_row[clean_skill] = (count /len(title_data) *100) iflen(title_data) >0else0 skill_title_matrix.append(skill_row)if skill_title_matrix:print(f"Matrix created with {len(skill_title_matrix)} rows") skill_matrix_df = pd.DataFrame(skill_title_matrix) skill_matrix_df = skill_matrix_df.set_index('Title')print(f"Dataframe shape: {skill_matrix_df.shape}")print(f"Columns: {list(skill_matrix_df.columns)}")# Shorten job titles if too long skill_matrix_df.index = [title[:25] +'...'iflen(title) >25else title for title in skill_matrix_df.index] fig, ax = plt.subplots(figsize=(8, 4)) sns.heatmap(skill_matrix_df, annot=True, fmt='.1f', cmap='YlOrRd', cbar_kws={'label': '% of Postings'}, ax=ax, linewidths=0.5, linecolor='white', annot_kws={'size': 7.5}) ax.set_title('Software Skill Requirements by Job Title', fontsize=12, fontweight='bold', pad=10) ax.set_xlabel('Software Skill', fontsize=10, fontweight='bold') ax.set_ylabel('Job Title', fontsize=10, fontweight='bold') plt.xticks(rotation=45, ha='right', fontsize=8) plt.yticks(rotation=0, fontsize=8) plt.tight_layout() plt.show()print("Heatmap displayed successfully!")else:print("ERROR: No skill matrix data created")else:print("ERROR: Required columns not found")
The heatmap reveals distinct skill patterns across job roles. Unclassified positions show the highest demand for general software skills (37.2% require SAP Applications), suggesting broad technical requirements. Oracle Cloud HCM Consultants have specialized focus with 11.1% requiring Oracle Cloud expertise. Data Governance Analysts need the most diverse skillset with strong SQL Programming Language (13.8%) and moderate requirements across Dashboard, Microsoft Office, and SAP. Enterprise Architects and Solutions Architects show concentrated needs in SAP (10.5% and 8.5% respectively) with minimal other software requirements. Data Modelers uniquely emphasize SQL Programming (2.1%) over other tools. The low percentages overall indicate that most roles don’t mandate specific software, creating opportunities for candidates to differentiate through technical mastery. Job seekers should target SAP for enterprise roles, Oracle Cloud for HCM consulting, and SQL for data-focused positions.
12.7 Summary
ImportantNLP Analysis Key Findings
Skills Analysis Results:
✓ Successfully analyzed skills from structured data columns
✓ Identified top 10 most common skills across all job postings
✓ Extracted and ranked top 10 software/technical skills
✓ Analyzed top 10 specialized skills for advanced roles
✓ Cross-referenced skills with job title requirements via heatmap
Key Insights:
Specific skills dominate the job market across different categories
Software skills are critical technical differentiators
Specialized skills offer opportunities for premium positioning
Skills requirements vary significantly by job title
Recommendations for Job Seekers:
Prioritize learning the top 10 skills identified in each category
Develop proficiency in the most demanded software tools
Build specialized skills for senior or expert-level opportunities
Focus on developing complementary skills that frequently co-occur
Use the heatmap to understand skill priorities for your target job titles
Tailor your resume to highlight relevant skill combinations
This analysis compares our team’s current skill levels against the skills demanded in the IT job market. By identifying gaps between our capabilities and market requirements, we can develop targeted learning strategies to enhance our competitiveness as job candidates.
import pandas as pdimport numpy as npimport hvplot.pandasimport panel as pnfrom collections import Counterimport reimport warningswarnings.filterwarnings('ignore')# Enable panel for renderingpn.extension()# Color paletteCOLORS = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6', '#1abc9c']# Load cleaned job datadf = pd.read_csv('cleanedjob_postings.csv')print(f"Analyzing {len(df):,} job postings for skill requirements")
Analyzing 59,220 job postings for skill requirements
13.2 1. Team Skill Assessment
13.2.1 1.1 Creating Team Skills Profile
Each team member rates their proficiency in key IT skills on a scale of 1-5: - 1 = Beginner (aware of the skill) - 2 = Basic knowledge (can perform simple tasks) - 3 = Intermediate (comfortable with common scenarios) - 4 = Advanced (can handle complex problems) - 5 = Expert (can teach others and solve any problem)
# Team member skill assessments# TODO: Replace with actual team member names and skill ratingsskills_data = {"Name": ["Tuba Anwar", "Kriti Singh", "Soham Deshkhaire"],"Python": [4, 3, 4],"SQL": [3, 4, 3],"Machine Learning": [3, 2, 4],"Data Analysis": [4, 4, 3],"Cloud Computing": [2, 2, 3],"Java": [3, 2, 2],"JavaScript": [2, 3, 3],"R": [3, 3, 2],"Tableau": [3, 4, 2],"Excel": [4, 5, 3],"AWS": [2, 1, 2],"Azure": [1, 2, 2],"Docker": [2, 1, 3],"Git": [3, 3, 4],"Power BI": [2, 3, 2],"Spark": [2, 2, 3],"TensorFlow": [2, 1, 3],"NLP": [3, 2, 3]}df_team_skills = pd.DataFrame(skills_data)df_team_skills.set_index("Name", inplace=True)# Display team skills#print("Team Skills Profile:")#print(df_team_skills)# Calculate team averagesteam_avg = df_team_skills.mean().sort_values(ascending=False)#print(f"\nTeam Average Skills (sorted):")#print(team_avg)
The team’s strongest skills are *Excel, Python, and Data Analysis, showing solid readiness for analytical and data-focused roles. Skills like **SQL, Git, Tableau, and Machine Learning* are moderately strong, indicating reliable but improvable proficiency. Meanwhile, areas such as AWS, Azure, TensorFlow, Docker, and Cloud Computing show lower scores, highlighting clear opportunities for growth—especially important given rising market demand for cloud and ML engineering skills. Overall, the team has a strong analytical foundation but should focus on boosting cloud and advanced technical competencies to stay competitive.
13.2.2 1.2 Team Skills Heatmap
Visualize each team member’s strengths and weaknesses across all skills.
This heatmap shows how strong each team member is across key tech skills like AWS, Excel, SQL, Machine Learning, and more. The greener the box, the stronger the skill—red means the weakest. From a job seeker’s perspective, this quickly highlights which skills are common strengths and which ones need improvement to stay competitive. For example, Excel, Data Analysis, and SQL show strong proficiency across the team, meaning these are essential, in-demand skills worth mastering. Meanwhile, AWS and JavaScript show weaker proficiency, signaling great opportunities for upskilling—especially since cloud and coding skills are highly valued in today’s market.
13.2.3 1.3 Team Average Skills Bar Chart
Compare team average proficiency across all skills.
# Prepare data for bar chartteam_avg_df = team_avg.reset_index()team_avg_df.columns = ['Skill', 'Average Proficiency']# Create bar chartchart = team_avg_df.sort_values('Average Proficiency', ascending=True).hvplot.barh( x='Skill', y='Average Proficiency', title='Team Average Skill Proficiency', height=700, width=900, color='#3498db', xlabel='Average Proficiency (1-5)', ylabel='', flip_yaxis=True)chart
The team’s top 10 skills show strong proficiency in high-value areas like *Excel, Data Analysis, Python, Git, and SQL, which are core requirements for most analytics and tech roles. Mid-level strengths in **Machine Learning, Tableau, R, and JavaScript* show the team can handle more advanced tasks but still has room to grow. Overall, these skills place the team in a competitive position for data and tech jobs, while highlighting opportunities to strengthen cloud and AI-related tools for even better market readiness.
13.3 2. Market Skill Demand Analysis
13.3.1 2.1 Extracting Skills from Job Descriptions
We analyze job postings to identify the most in-demand skills in the IT job market.
# Define comprehensive skill keywords to search forskill_keywords = {'Python': ['python', 'python3', 'py'],'SQL': ['sql', 'mysql', 'postgresql', 'sql server', 'oracle sql'],'Machine Learning': ['machine learning', 'ml', 'deep learning', 'neural network'],'Data Analysis': ['data analysis', 'data analytics', 'analytical'],'Cloud Computing': ['cloud', 'cloud computing', 'cloud services'],'Java': ['java', 'java8', 'java 8'],'JavaScript': ['javascript', 'js', 'node.js', 'nodejs'],'R': [' r ', 'r programming', 'rstudio'],'Tableau': ['tableau'],'Excel': ['excel', 'microsoft excel', 'advanced excel'],'AWS': ['aws', 'amazon web services', 'ec2', 's3'],'Azure': ['azure', 'microsoft azure'],'Docker': ['docker', 'containerization'],'Git': ['git', 'github', 'version control'],'Power BI': ['power bi', 'powerbi'],'Spark': ['spark', 'apache spark', 'pyspark'],'TensorFlow': ['tensorflow', 'tf'],'NLP': ['nlp', 'natural language processing', 'text mining']}# Function to extract skills from textdef extract_skills(text):if pd.isna(text):return [] text =str(text).lower() found_skills = []for skill, keywords in skill_keywords.items():for keyword in keywords:if keyword in text: found_skills.append(skill)breakreturn found_skills# Extract skills from job titles and descriptions (if available)if'TITLE_NAME'in df.columns: df['extracted_skills'] = df['TITLE_NAME'].apply(extract_skills)# If job description available, combine with titleif'DESCRIPTION'in df.columns or'JOB_DESCRIPTION'in df.columns: desc_col ='DESCRIPTION'if'DESCRIPTION'in df.columns else'JOB_DESCRIPTION' df['extracted_skills'] = df.apply(lambda row: list(set(extract_skills(row['TITLE_NAME']) + extract_skills(row[desc_col]))), axis=1 )# Count skill occurrencesall_skills = [skill for skills_list in df['extracted_skills'] for skill in skills_list]skill_counts = Counter(all_skills)market_skills_df = pd.DataFrame(skill_counts.items(), columns=['Skill', 'Job Postings'])market_skills_df = market_skills_df.sort_values('Job Postings', ascending=False)market_skills_df['Percentage'] = (market_skills_df['Job Postings'] /len(df) *100).round(2)#print("Market Skill Demand (Top Skills):")#print(market_skills_df.head(10))
The skill extraction shows that Data Analysis, Cloud Computing, and TensorFlow are the most frequently requested skills in job postings, making them top priorities for job seekers. Technical fundamentals like Git, SQL, and Azure also appear often, highlighting the importance of both data-related and cloud skills. Lower-frequency skills such as Spark or AWS still matter but are requested less often, suggesting they may serve as valuable “bonus” skills rather than core requirements.
13.3.2 2.2 Top In-Demand Skills Visualization
# Create bar chart of top 10 market skillstop_market_skills = market_skills_df.head(10).sort_values('Job Postings', ascending=True)chart = top_market_skills.hvplot.barh( x='Skill', y='Job Postings', title='Top 10 Most In-Demand Skills in Job Market (2024)', height=600, width=900, color='#3498db', hover_cols=['Percentage'], xlabel='Number of Job Postings Requiring Skill', ylabel='', flip_yaxis=True)chart
The chart shows that Data Analysis and Cloud Computing are the most in-demand skills in the 2024 job market, with far more postings than any other skill. This means employers are strongly prioritizing candidates who can analyze data and work with cloud platforms. Skills like *TensorFlow, Git, SQL, Azure, and **Java* are also highly requested, making them great additions to your skillset. Lower-demand skills such as *Spark, AWS, Tableau, and **Machine Learning* still matter, but they won’t give as much of a competitive edge as the top skills.
13.4 3. Skill Gap Identification
13.4.1 3.1 Comparing Team Skills to Market Demand
Now we identify gaps between our team’s capabilities and what the market requires.
# Normalize market demand to 1-5 scale for comparison# We'll scale based on percentage of jobs requiring each skillmax_percentage = market_skills_df['Percentage'].max()# Create comparison dataframecomparison_data = []for skill in df_team_skills.columns: team_avg_skill = team_avg[skill]# Get market demand (normalized to 1-5 scale)if skill in market_skills_df['Skill'].values: market_pct = market_skills_df[market_skills_df['Skill'] == skill]['Percentage'].values[0] market_demand = (market_pct / max_percentage) *5# Scale to 1-5else: market_demand =0 gap = market_demand - team_avg_skill comparison_data.append({'Skill': skill,'Team Average': team_avg_skill,'Market Demand': market_demand,'Gap': gap,'Gap_Category': 'Strength'if gap <=0else ('Moderate Gap'if gap <=2else'Critical Gap') })comparison_df = pd.DataFrame(comparison_data)comparison_df = comparison_df.sort_values('Gap', ascending=False)#print("Skill Gap Analysis:")#print(comparison_df)
The analysis shows that our team is *strong in most core skills, especially Python, SQL, Tableau, Machine Learning, and Git — all areas where market demand is relatively low to moderate. However, there are **two skills with clear gaps* compared to what employers want:
Cloud Computing (e.g., AWS, Azure)
Data Analysis
These areas are in very high demand, but our team’s proficiency is lower than what the job market expects. Strengthening Cloud tools and advanced Data Analysis techniques would significantly boost job readiness and competitiveness in today’s tech hiring landscape.
The chart clearly shows that market demand is much higher than the team’s current skill levels across nearly all skills. The biggest gaps appear in Cloud Computing, Data Analysis, TensorFlow, Azure, and AWS, meaning these are the most urgent areas for development. Skills like Git, SQL, Python, and Machine Learning are team strengths, but still trail behind what employers expect.
For job seekers like us, this means focusing on cloud technologies, advanced analytics, and AI frameworks will significantly boost competitiveness and align better with real market needs.
13.4.3 3.3 Skill Gap Priority Matrix
Identify which skills need immediate attention based on gap size.
# Create scatter plot: Team proficiency vs Market demandchart = comparison_df.hvplot.scatter( x='Team Average', y='Market Demand', by='Gap_Category', size=abs(comparison_df['Gap']) *50, hover_cols=['Skill', 'Gap'], title='Skill Gap Priority Matrix', height=700, width=900, xlabel='Team Average Proficiency (1-5)', ylabel='Market Demand Level (1-5 normalized)', legend='top_left', color=['#2ecc71', '#f39c12', '#e74c3c'])chart
The matrix helps identify which skills your team should focus on first.
Green (Moderate Gap): These skills—like Cloud Computing, Data Analysis, and TensorFlow—are high in market demand but lower in team proficiency, meaning they should be top priority for upskilling.
Orange (Strength): Most other skills fall in this category. These are areas where the team is already strong compared to market demand—great to maintain but not urgent for improvement.
13.5 4. Individual Skill Gap Analysis
13.5.1 4.1 Skill Gaps by Team Member
Identify personalized skill development needs for each team member.
# Calculate individual gapsindividual_gaps = []for member in df_team_skills.index:for skill in df_team_skills.columns: member_skill = df_team_skills.loc[member, skill]# Get market demandif skill in market_skills_df['Skill'].values: market_pct = market_skills_df[market_skills_df['Skill'] == skill]['Percentage'].values[0] market_demand = (market_pct / max_percentage) *5else: market_demand =0 gap = market_demand - member_skillif gap >0: # Only include gaps (areas for improvement) individual_gaps.append({'Member': member,'Skill': skill,'Current Level': member_skill,'Market Demand': market_demand,'Gap': gap })individual_gaps_df = pd.DataFrame(individual_gaps)individual_gaps_df = individual_gaps_df.sort_values(['Member', 'Gap'], ascending=[True, False])# Show top 5 gaps per member#print("Top 5 Skill Gaps per Team Member:")for member in df_team_skills.index:#print(f"\n{member}:") member_gaps = individual_gaps_df[individual_gaps_df['Member'] == member].head(5)#print(member_gaps[['Skill', 'Current Level', 'Gap']].to_string(index=False))
ach team member has two main skill gaps: Cloud Computing and Data Analysis.
Tuba Anwar needs improvement mainly in Cloud Computing, with a smaller gap in Data Analysis.
Kriti Singh also shows the same pattern—Cloud Computing is the biggest gap, followed by Data Analysis.
Soham Deshkhaires has the highest gap in Data Analysis, and a moderate gap in Cloud Computing.
Overall Insight: All three team members share the same critical areas for improvement. Strengthening Cloud Computing and Data Analysis should be the top priority for the team.
13.5.2 4.2 Individual Gap Visualization
# Get top 5 gaps per membertop_individual_gaps = individual_gaps_df.groupby('Member').head(5)# Create bar chart for each memberfor member in df_team_skills.index: member_data = top_individual_gaps[top_individual_gaps['Member'] == member].sort_values('Gap', ascending=True) plot = member_data.hvplot.barh( x='Skill', y='Gap', title=f'Top 5 Skill Gaps - {member}', height=400, width=700, color='#e74c3c', xlabel='Skill Gap (Market - Current)', ylabel='', flip_yaxis=True ) display(plot)
Across all three members—Tuba, Kriti, and Soham—the largest skill gap is in Cloud Computing, meaning the market demands this skill at a much higher level than the team currently possesses. This makes Cloud Computing the top priority area for improvement for everyone.
Additionally, Data Analysis appears as a major gap for both Tuba and Kriti, while Soham shows a moderate gap in Cloud Computing only. This indicates that although the team already has some analytical skills, the job market expects a stronger command in this area.
The charts highlight two urgent development needs for the team:
Cloud Computing → biggest gap for all members
Data Analysis → second-highest gap for Tuba & Kriti
Focusing training efforts on these skills will significantly improve alignment with market expectations.individual
13.6 5. Skill Development Plan
13.6.1 5.1 Priority Skills for Team Development
Based on our analysis, here are the priority skills the team should focus on:
The analysis shows that the team has *no critical skill gaps, meaning no skill is urgently below market expectations. However, two areas — **Cloud Computing* and Data Analysis — fall into the moderate-priority category. These skills have higher market demand than the team’s current proficiency levels, making them important for upskilling. Focusing on these areas will help the team stay competitive, align with industry expectations, and strengthen overall technical capability.
# Create learning recommendationslearning_resources = {'Python': {'Courses': ['Python for Data Science (Coursera)', 'Complete Python Bootcamp (Udemy)'],'Practice': ['LeetCode Python problems', 'HackerRank Python track'],'Time': '2-3 months for intermediate proficiency' },'SQL': {'Courses': ['SQL for Data Science (Coursera)', 'The Complete SQL Bootcamp (Udemy)'],'Practice': ['SQLZoo', 'LeetCode Database problems'],'Time': '1-2 months' },'Machine Learning': {'Courses': ['Machine Learning by Andrew Ng (Coursera)', 'Fast.ai Practical Deep Learning'],'Practice': ['Kaggle competitions', 'Personal ML projects'],'Time': '3-6 months' },'AWS': {'Courses': ['AWS Certified Solutions Architect (A Cloud Guru)', 'AWS Free Tier hands-on'],'Practice': ['Build personal projects on AWS', 'AWS Cloud Quest'],'Time': '2-3 months' },'Docker': {'Courses': ['Docker Mastery (Udemy)', 'Docker documentation'],'Practice': ['Containerize personal projects', 'Docker Hub'],'Time': '1-2 months' },'Cloud Computing': {'Courses': ['Cloud Computing Concepts (Coursera)', 'Google Cloud Training'],'Practice': ['Multi-cloud projects', 'Free tier experimentation'],'Time': '2-4 months' }}
13.6.2 5.3 Team Collaboration Strategy
How can team members help each other?
# Identify team strengths (where team exceeds market demand)team_strengths = comparison_df[comparison_df['Gap'] <0].sort_values('Team Average', ascending=False)#print("TEAM STRENGTHS (Can mentor others):")#print(team_strengths[['Skill', 'Team Average', 'Market Demand']])# Create mentoring pairs based on individual strengths#print("\n\nSUGGESTED MENTORING OPPORTUNITIES:")
The team collaboration strategy highlights how members can support one another by leveraging individual strengths to address skill gaps. Team strengths such as Excel, Python, SQL, Git, Tableau, Machine Learning, JavaScript, NLP, R, Spark, Power BI, Java, TensorFlow, Docker, Azure, and AWS show which members have above-average proficiency and can mentor others. Suggested mentoring pathways include: Soham Deshkhaire (Level 4) mentoring Kriti Singh (Level 2) in Machine Learning, Kriti Singh (Level 4) mentoring Soham in Tableau, Kriti (Level 5) mentoring Soham (Level 3) in Excel, and Soham (Level 3) mentoring Kriti (Level 1) in Docker. These targeted pairings ensure knowledge transfer, help close individual skill gaps, and strengthen overall team capability.
13.6.3 5.4 3-Month, 6-Month, and 1-Year Goals
Create timeline for skill development:
TipSkill Development Timeline
3-Month Goals (Immediate Priority) - Focus on critical gap skills with highest market demand - Complete foundational courses in AWS, Docker, and Cloud Computing - Build 1-2 hands-on projects demonstrating new skills - Team members mentor each other in strength areas
6-Month Goals (Intermediate) - Achieve intermediate proficiency (Level 3) in all critical gap skills - Complete advanced courses in Machine Learning and Data Analysis - Participate in Kaggle competitions or contribute to open-source projects - Earn 1-2 professional certifications (e.g., AWS Certified Developer)
1-Year Goals (Advanced) - Achieve advanced proficiency (Level 4) in priority skills - Entire team reaches minimum Level 3 in all high-demand market skills - Build comprehensive portfolio showcasing technical competencies - Competitive job candidates for targeted IT roles
This section applies machine learning techniques to uncover patterns in job market data, with a specific focus on Business Analytics, Data Science, and Machine Learning roles. As job seekers entering these competitive fields in 2024, understanding the hidden structures in job postings and identifying role characteristics can provide strategic advantages in career planning.
We employ two complementary machine learning approaches:
K-Means Clustering: To discover natural groupings in BA/DS/ML job postings
Classification Models: To distinguish between different role types
Dataset loaded: 59,220 rows, 56 columns
14.2 Data Filtering for BA/DS/ML Analysis
To focus our analysis on relevant career paths for Business Analytics (BA), Data Science (DS), and Machine Learning (ML) professionals, we filter the dataset to include only positions matching these disciplines. Number of filtered jobs for BA/DS/ML are 15,378. This is around 25.97% of the data
import pandasjob_titles = pd.read_csv("./_output/Filtered_Job_Titles.csv")# hide index pandasjob_titles.style.hide(axis="index")
TITLE_NAME
count
Data Analysts
6409
ERP Business Analysts
369
Data Analytics Engineers
343
Data Analytics Interns
328
Lead Data Analysts
319
Data Analytics Analysts
256
Master Data Analysts
234
Business Intelligence Data Analysts
223
IT Data Analytics Analysts
221
SAP Business Analysts
206
14.3 Feature Engineering
Before applying machine learning algorithms, we need to prepare our features. We’ll focus on quantitative measures that can help us understand job characteristics.
# Calculate average salary if not already presentif'AVG_SALARY'notin df_filtered.columns:# Create synthetic salary data for demonstration# In real analysis, you would have actual salary data np.random.seed(42) df_filtered['AVG_SALARY'] = np.random.normal(95000, 25000, len(df_filtered)) df_filtered['AVG_SALARY'] = df_filtered['AVG_SALARY'].clip(lower=40000, upper=200000)# Create experience level from MIN_YEARS_EXPERIENCEdf_filtered['EXPERIENCE_YEARS'] = df_filtered['MIN_YEARS_EXPERIENCE'].fillna(0)# Convert DURATION to numeric (days)df_filtered['DURATION_DAYS'] = pd.to_numeric(df_filtered['DURATION'], errors='coerce').fillna(30)# Create binary remote indicatordf_filtered['IS_REMOTE'] = (df_filtered['REMOTE_TYPE_NAME'] =='Remote').astype(int)# Summary statisticsprint("summarydf")summarydf = df_filtered[['AVG_SALARY', 'EXPERIENCE_YEARS', 'DURATION_DAYS']].describe()summarydf.to_csv("./_output/Continuous_summary.csv")
summarydf
14.4 K-Means Clustering Analysis
Clustering helps us discover natural groupings in the job market. Different clusters might represent entry-level vs. senior positions, different specializations, or regional variations.
Figure 1: K-Means Clustering for the Job Posting Data
ELBOW METHOD The inertia drops sharply from 2 to 4 clusters, showing that most of the meaningful structure in the data is captured within this range. After 4 clusters, the curve begins to flatten, indicating diminishing returns from adding more clusters. This pattern suggests that K = 4 is the optimal and most efficient choice for segmenting the dataset ### Apply K-Means with Optimal K
# Choose optimal K (typically where elbow occurs, around 3-4)optimal_k =4scaler = StandardScaler()X_scaled = scaler.fit_transform(df_cluster)kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)df_cluster['Cluster'] = kmeans.fit_predict(X_scaled)#print(f"\nClustering complete with K={optimal_k}")#print("\nCluster distribution:")#print(df_cluster['Cluster'].value_counts().sort_index())# Analyze cluster characteristics#print("\nCluster Characteristics:")cluster_summary = df_cluster.groupby('Cluster').agg({'AVG_SALARY': ['mean', 'median'],'EXPERIENCE_YEARS': 'mean','DURATION_DAYS': 'mean','IS_REMOTE': 'mean'}).round(2)cluster_summary.to_csv("./_output/cluster_summary.csv")print(cluster_summary)
Cluster 0 represents higher-paying roles with moderate experience requirements and shorter durations, mostly non-remote. Cluster 1 contains lower-salary positions that require slightly more experience and also tend to be non-remote. Cluster 2 features mid-range salaries with longer job durations and very limited remote availability, while Cluster 3 offers similar salaries but is fully remote, making it the remote-friendly segment of the job market.
the first principal component explains 26.95% of the total variance and the second explains 25.08%, for a combined total of 52.03%. The PCA scatter plot maps the job postings onto these two components and shows four clusters formed using K-means (K=4), with each cluster occupying its own region despite some overlap.
Figure 2: PCA for the Job Posting Data
14.5 Classification: Role Type Prediction
Understanding the distinguishing characteristics of different role types can help job seekers tailor their applications and skill development.
14.5.1 Create Role Categories
def categorize_role(title):"""Categorize job titles into BA, DS, ML, or Data Analytics"""if pd.isna(title):return'Other' title_lower =str(title).lower()ifany(word in title_lower for word in ['business analyst', 'business intelligence']):return'Business Analytics'elifany(word in title_lower for word in ['machine learning', 'ml engineer', 'ai engineer']):return'Machine Learning'elifany(word in title_lower for word in ['data scientist', 'data science']):return'Data Science'elifany(word in title_lower for word in ['data analyst', 'data analytics']):return'Data Analytics'else:return'Other'# Apply categorizationdf_filtered['ROLE_CATEGORY'] = df_filtered['TITLE_NAME'].apply(categorize_role)# Filter to main categoriesmain_categories = ['Business Analytics', 'Data Science', 'Machine Learning', 'Data Analytics']df_clf = df_filtered[df_filtered['ROLE_CATEGORY'].isin(main_categories)].copy()#print(f"Classification dataset: {len(df_clf):,} samples")#print("\nRole distribution:")role_dist = df_clf['ROLE_CATEGORY'].value_counts()#print(role_dist)#print("\nPercentages:")#print(role_dist / len(df_clf) * 100)role_dist.to_csv("./_output/Role_Categories.csv")print(role_dist)
ROLE_CATEGORY
Data Analytics 11944
Business Analytics 1776
Data Science 419
Machine Learning 4
Name: count, dtype: int64
f1 = pd.read_csv("./_output/f1.csv")f1.style.hide(axis="index")
f1_score
0.774877
The logistic regression model reaches 84% accuracy, but this is mainly because it predicts most entries as “Data Analytics,” the largest class in the dataset. While the model performs well for this category, it struggles to recognize smaller roles like Business Analytics, Data Science, and Machine Learning, which show very low recall and F1-scores. This imbalance means the model is not effectively distinguishing minority roles and is primarily learning from the dominant class rather than providing balanced prediction
The model reaches a high overall accuracy of 85.6%, but this is influenced by the extreme class imbalance in the dataset. It predicts the dominant Data Analytics category very well, yet performs poorly on the smaller groups -Business Analytics, Data Science, and Machine Learning which is leading to a low F1 score of 0.33. This shows that the model is not generalizing effectively across all role types. To achieve more balanced and reliable results, techniques such as oversampling, class weighting, or rebalancing the dataset would be needed.
The ROC curves show that the logistic regression model has moderate ability to distinguish between the different role categories, with AUC scores ranging from 0.58 to 0.77. Machine Learning achieves the highest AUC (0.775), suggesting the model can separate this class reasonably well despite its tiny sample size, while Data Science has the weakest separability (0.585). Overall, the curves indicate that the classifier performs above random chance for all roles but still struggles to clearly differentiate between them, reflecting the impact of class imbalance and overlapping feature patterns.
The Random Forest model shows improved class separability compared to logistic regression, with AUC values ranging from 0.60 to 0.84. Machine Learning achieves the strongest performance (AUC = 0.842), indicating the model can distinguish this role well despite its tiny sample size. Business Analytics and Data Analytics also show moderate discrimination, while Data Science remains the most challenging class, reflecting overlapping features and limited data representation.
The comparison shows that Random Forest outperforms Logistic Regression, achieving both higher accuracy (85.6%) and a stronger F1 score (0.815). This indicates that Random Forest handles the complex and imbalanced role categories more effectively. Overall, while both models perform well, Random Forest delivers more balanced and reliable predictions across the dataset.
The feature importance results show that *experience years, average salary, and job duration are the strongest predictors in distinguishing between BA, Data Science, ML, and Data Analytics roles. Remote status contributes modestly, while location-based features (state variables) have minimal impact, indicating that job role differences are driven more by skill level and job characteristics than by geography. Overall, the model relies most heavily on experience and salary patterns to differentiate job categories.
15 Conclusions and Recommendations
15.1 Summary of Findings
Our comprehensive analysis of the 2024 job market reveals several critical insights for job seekers in data analytics and related fields:
15.1.1 Market Demand
Enterprise Software Dominance: SAP Applications (1,114 postings) and Oracle Cloud (667 postings) lead the market, indicating strong demand for enterprise system expertise
Foundational Skills: Microsoft Office (637), Data Analysis (343), and SQL (357) remain essential across all role types
Emerging Specializations: Cloud computing, UX design, and specialized analytics tools show growing importance
15.1.2 Skill Requirements by Role
Our analysis identified distinct skill patterns across job categories:
Data Analysts: Strong emphasis on SQL (7.6%), Excel, and visualization tools
Business Intelligence Analysts: Focus on dashboard creation and BI platforms
Enterprise Architects: Concentrated need for SAP (10.5%) and enterprise system knowledge
Data Scientists: Requirements span machine learning, Python, and statistical analysis
15.1.3 Skill Gap Findings
Comparing team member skills against market requirements revealed:
Real-Time Monitoring: Develop dashboard for continuous market tracking
Industry-Specific Analysis: Deep dive into particular sectors (FinTech, Healthcare, etc.)
Network Analysis: Explore skill co-occurrence patterns and career pathway modeling
15.4 Final Thoughts
The 2024 job market for data analytics professionals presents significant opportunities for those who strategically develop their skill sets. While traditional foundations (SQL, Excel, data analysis) remain essential, the market increasingly rewards specialization in enterprise systems, cloud platforms, and advanced analytics tools.
Success in this market requires a balanced approach: maintaining strong fundamentals while developing expertise in high-demand specialized areas. The skill gap analysis methodology presented in this report provides a replicable framework for continuous career development assessment.
By leveraging these insights and recommendations, job seekers can position themselves competitively in a dynamic and evolving market landscape.
16 References
Georgieff, Alexandre, and Remi Hyee. 2022. “Artificial Intelligence and Employment: New Cross-Country Evidence.”Frontiers in Artificial Intelligence 5: 832736. https://doi.org/10.3389/frai.2022.832736.
Gerhart, Nicholas, Ehsan Rastegari, and Elizabeth Cole. 2024. “Analytics: What Do Business Majors Need and Where Do They Get It?”Journal of Statistics and Data Science Education 1: 1–28. https://doi.org/10.1080/26939169.2024.2393427.
Graham, C. M. 2025. “AI Skills in Cybersecurity: Global Job Trends Analysis.”Information & Computer Security.
Liu, J., K. Chen, and W. Lyu. 2024. “Embracing Artificial Intelligence in the Labour Market: The Case of Statistics.”Humanities and Social Sciences Communications.
Liu, Ye, and Chuanning Li. 2024. “Insights into Talent Cultivation in Big Data Management and Application Major Based on Recruitment Information.”Procedia Computer Science 242: 568–75. https://doi.org/10.1016/j.procs.2024.08.107.
Smaldone, Francesco, Adelaide Ippolito, Jelena Lagger, and Marco Pellicano. 2022. “Employability Skills: Profiling Data Scientists in the Digital Labour Market.”European Management Journal 40 (5): 671–84. https://doi.org/10.1016/j.emj.2022.05.005.
U.S. Bureau of Labor Statistics. 2024. “Occupational Employment and Wage Statistics: Data Scientists (15-2051).” Washington, D.C.: U.S. Department of Labor.